We started the project by uploading the data from the 7 CSV files that were uploaded in MySQL.
pacman::p_load(RMySQL)
mysqlconnection = dbConnect(
RMySQL::MySQL(),
dbname='gedata',
host='mysql.fsb.miamioh.edu',
port=3306,
user='fsbstud',
password='fsb4you')
dbListTables(mysqlconnection)
## [1] "engine_data_aic" "engine_data_axm"
## [3] "engine_data_fron" "engine_data_pgt"
## [5] "esn_rul" "lkp_airport_codes_t"
## [7] "manufacturing_sql_by_esn"
Next, we created data frames in R from the data tables hosted in MySQL. We uploaded the data from each CSV file into it’s own data frame.
engine_data_aic= RMySQL::dbReadTable(mysqlconnection, name= "engine_data_aic")
engine_data_axm= RMySQL::dbReadTable(mysqlconnection, name= "engine_data_axm")
engine_data_fron= RMySQL::dbReadTable(mysqlconnection, name= "engine_data_fron")
engine_data_pgt= RMySQL::dbReadTable(mysqlconnection, name= "engine_data_pgt")
esn_rul= RMySQL::dbReadTable(mysqlconnection, name= "esn_rul")
lkp_airport_codes_t= RMySQL::dbReadTable(mysqlconnection, name= "lkp_airport_codes_t")
manufacturing_sql_by_esn= RMySQL::dbReadTable(mysqlconnection, name= "manufacturing_sql_by_esn")
Then, we set each dataset as its own data frame in R.
engine_data= rbind(engine_data_aic, engine_data_axm, engine_data_fron, engine_data_pgt)
Now, we joined the esn_rul table to the table with all of the engine data using an inner join.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
esn_joined_data= inner_join(engine_data, esn_rul, na_matches= 'na') # gives only training data
## Joining, by = "esn"
We then joined the manufacturing data table to the combined table using an inner join.
join_data= inner_join(esn_joined_data, manufacturing_sql_by_esn, na_matches= 'na')
## Joining, by = "esn"
The reasoning for this step is discussed under the “Consistent Data” header, it just needs to be placed here for the dataset to be created properly.
for (i in 1:length(join_data$depart_icao)) {
if(join_data$depart_icao[i] == "")
join_data$depart_icao[i] <- join_data$destination_icao[i-1]
if(join_data$destination_icao[i] == "")
join_data$destination_icao[i] <- join_data$depart_icao[i]}
Finally, we joined the airport codes table to our combined table using a set of inner joins. We joined this table on both the depart_icao and the destination_icao columns since we need the coordinates for both airports in order to calculate the distance between airports.
airport_depart= left_join(join_data, lkp_airport_codes_t, by= c("depart_icao"= "airport_icao"))
airport_dest= left_join(airport_depart, lkp_airport_codes_t, by= c("destination_icao"= "airport_icao"))
airport_final= airport_dest
Next, we used the distHaversine function from the geosphere package to calculate the distance between airports. The answer gave us the distance in meters, so we converted it to miles.
pacman::p_load(swfscMisc)
pacman::p_load(geosphere)
airport_final$distance= distHaversine(p1= cbind(airport_final$longitude.x, airport_final$latitude.x), p2= cbind(airport_final$longitude.y, airport_final$latitude.y))
airport_final$distance= airport_final$distance / 1000
airport_final$distance= convert.distance(airport_final$distance, from = "km", to = "mi")
Next, we ensured that the data was tidy, technically correct, and consistent.
We started by checking for tidiness. Through our observation of the data, we concluded that the datetime column needed to be split into date and time in order for the data to be tidy.
pacman::p_load(lubridate, tidyr, hms)
airport_final$datetime= as_datetime(airport_final$datetime)
Tech_Correct = separate(airport_final, datetime, c("date", "time"), sep = ' ')
Now that we have ensured data tidiness, we started to make sure the data was technically correct. We cleaned the column names using the clean names function and also used glimpse to check for data types. We discovered that we needed to switch the date and time columns to be of the date and time data types. We also switched the names of the latitude and longitude variables to make them easier to read and understand.
pacman::p_load(janitor)
library(janitor)
Tech_Correct = clean_names(Tech_Correct)
glimpse(Tech_Correct)
## Rows: 13,096
## Columns: 45
## $ dataset <chr> "test_FD001", "test_FD001", "test_FD001",…
## $ esn <int> 999120, 999120, 999120, 999120, 999120, 9…
## $ unit <int> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 2…
## $ flight_cycle <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13…
## $ date <chr> "2018-02-01", "2018-02-01", "2018-02-08",…
## $ time <chr> "13:47:42", "18:22:24", "00:46:00", "03:5…
## $ operator <chr> "AIC", "AIC", "AIC", "AIC", "AIC", "AIC",…
## $ depart_icao <chr> "LFBO", "LFBO", "VIDP", "VABB", "VIDP", "…
## $ destination_icao <chr> "LFBO", "VIDP", "VABB", "VIDP", "VABB", "…
## $ hpc_eff_mod <dbl> 0.0011, -0.0001, -0.0015, -0.0022, -0.006…
## $ hpc_flow_mod <dbl> -5e-04, -3e-04, 3e-04, 1e-04, -1e-04, 5e-…
## $ tra <int> 100, 100, 100, 100, 100, 100, 100, 100, 1…
## $ t2 <dbl> 518.67, 518.67, 518.67, 518.67, 518.67, 5…
## $ t24 <dbl> 182.59, 182.56, 182.54, 183.01, 182.73, 1…
## $ t30 <dbl> 1588.38, 1590.73, 1590.70, 1588.91, 1588.…
## $ t50 <dbl> 1397.780, 1405.880, 1401.310, 1403.280, 1…
## $ p2 <dbl> 14.62, 14.62, 14.62, 14.62, 14.62, 14.62,…
## $ p15 <dbl> 21.61, 21.61, 21.60, 21.61, 21.61, 21.61,…
## $ p30 <dbl> 554.43, 553.62, 554.24, 553.79, 553.90, 5…
## $ nf <dbl> 2388.07, 2388.06, 2388.08, 2388.06, 2388.…
## $ nc <dbl> 9072.24, 9059.50, 9064.83, 9058.04, 9064.…
## $ epr <dbl> 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1…
## $ ps30 <dbl> 47.40, 47.32, 47.19, 47.28, 47.42, 47.21,…
## $ phi <dbl> 521.88, 522.03, 521.80, 522.25, 522.17, 5…
## $ nrf <dbl> 2388.06, 2388.06, 2388.03, 2388.06, 2388.…
## $ nrc <dbl> 8147.58, 8151.74, 8146.37, 8144.65, 8146.…
## $ bpr <dbl> 8.3923, 8.4385, 8.4234, 8.3955, 8.4371, 8…
## $ farb <dbl> 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03,…
## $ htbleed <int> 393, 392, 392, 391, 393, 394, 391, 392, 3…
## $ nf_dmd <int> 2388, 2388, 2388, 2388, 2388, 2388, 2388,…
## $ pcnfr_dmd <int> 100, 100, 100, 100, 100, 100, 100, 100, 1…
## $ w31 <dbl> 38.76, 38.90, 38.96, 38.94, 38.94, 39.11,…
## $ w32 <dbl> 23.4970, 23.4240, 23.4460, 23.3412, 23.44…
## $ rul <int> 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 1…
## $ x44321p02_op016_median_first <dbl> 23.25222, 23.25222, 23.25222, 23.25222, 2…
## $ x44321p02_op420_median_first <dbl> 14.54578, 14.54578, 14.54578, 14.54578, 1…
## $ x54321p01_op116_median_first <dbl> 22.15643, 22.15643, 22.15643, 22.15643, 2…
## $ x54321p01_op220_median_first <dbl> 29.89512, 29.89512, 29.89512, 29.89512, 2…
## $ x65421p11_op232_median_first <dbl> 188.632, 188.632, 188.632, 188.632, 188.6…
## $ x65421p11_op630_median_first <dbl> 231.0214, 231.0214, 231.0214, 231.0214, 2…
## $ latitude_x <dbl> 43.635, 43.635, 28.569, 19.092, 28.569, 1…
## $ longitude_x <dbl> 1.368, 1.368, 77.112, 72.866, 77.112, 72.…
## $ latitude_y <dbl> 43.635, 28.569, 19.092, 28.569, 19.092, 1…
## $ longitude_y <dbl> 1.368, 77.112, 72.866, 77.112, 72.866, 76…
## $ distance <dbl> 0.0000, 4211.6346, 708.2600, 708.2600, 70…
colnames(Tech_Correct)[41]= "depart_latitude"
colnames(Tech_Correct)[42]= "depart_longitude"
colnames(Tech_Correct)[43]= "dest_latitude"
colnames(Tech_Correct)[44]= "dest_longitude"
Tech_Correct$date = as.Date(Tech_Correct$date)
Tech_Correct$time= as_hms(Tech_Correct$time)
Finally, we checked our data for consistency. We found that some fields were missing information for their departure and arrival airports. Under further investigation, it seems that these planes were inspected but did not leave the airport, or possibly were delayed or took off and aborted. We ordered the data by esn and flight cycle and realized that the data fields with emoty strings for depart_icao and destination_icao and realized that the destination_icao of the record before and the depart_icao of the record after were the same. We concluded that these planes in fact did not leave the airport for some reason but were still inspected. We use for loops to impute the departure airport information of the flight directly before to get rid of the nulls and allow R to fill in the information for latitude, longitude, and distance (which is 0 for all of the planes we imputed in this case). This imputation allows us to avoid nulls and to understand that these planes did not move but their inspection values still may be relevant for predicting the remaining usueful life of the engines, therefore we opted to keep them in the dataset.
The code we used to do this is under the “Imputing Nulls” header. We had to put it at that step in order for the nulls to correctly be imputed.
pacman::p_load(DataExplorer)
plot_missing(Tech_Correct)
Next, we utilized the pointblank package to create a data validation table for our cleaned data set. In this table, we checked to make sure that each column was of their expected type, each column was not null, and that some column were greater than or equal to 0.
pacman::p_load(pointblank)
act= action_levels(warn_at= 0.01, notify_at= 0.01)
agent= create_agent(tbl= Tech_Correct, actions= act)
agent %>%
col_is_date(columns= 'date') %>%
col_is_posix(columns= 'time') %>%
col_is_integer(columns= vars(esn, unit, flight_cycle, tra, htbleed, nf_dmd, pcnfr_dmd, rul)) %>%
col_is_numeric(columns= vars(hpc_eff_mod, hpc_flow_mod, t2, t24, t30, t50, p2, p15, p30, nf, nc, epr, ps30, phi, nrf, nrc, bpr)) %>%
col_is_numeric(columns= vars(farb, w31, w32, depart_latitude, depart_longitude, dest_latitude, dest_longitude, distance)) %>%
col_is_numeric(columns= vars(x65421p11_op630_median_first, x65421p11_op232_median_first,x54321p01_op220_median_first)) %>%
col_is_numeric(columns= vars(x54321p01_op116_median_first, x44321p02_op420_median_first, x44321p02_op016_median_first)) %>%
col_is_character(columns= vars(dataset, operator, depart_icao, destination_icao)) %>%
col_vals_gte(columns= vars(unit, flight_cycle, rul, distance), value= 0) ->
agent
results= interrogate(agent)
results %>% export_report(filename= 'ge_project_evaluation.html')
## ✔ The agent has been written as `ge_project_evaluation.html`
results
| Pointblank Validation | |||||||||||||
| [2022-11-12|17:56:44]
data frame
Tech_CorrectWARN
0.01
STOP
—
NOTIFY
0.01
|
|||||||||||||
| STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | S | N | EXT | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | col_is_date()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 2 | col_is_posix()
|
— |
|
✓ |
1 |
00 |
11 |
● |
— |
● |
— | ||
| 3 | col_is_integer()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 4 | col_is_integer()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 5 | col_is_integer()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 6 | col_is_integer()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 7 | col_is_integer()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 8 | col_is_integer()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 9 | col_is_integer()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 10 | col_is_integer()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 11 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 12 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 13 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 14 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 15 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 16 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 17 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 18 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 19 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 20 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 21 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 22 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 23 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 24 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 25 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 26 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 27 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 28 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 29 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 30 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 31 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 32 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 33 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 34 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 35 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 36 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 37 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 38 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 39 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 40 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 41 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 42 | col_is_character()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 43 | col_is_character()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 44 | col_is_character()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 45 | col_is_character()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
— |
○ |
— | ||
| 46 | col_vals_gte()
|
|
✓ |
13K |
13K1 |
00 |
○ |
— |
○ |
— | |||
| 47 | col_vals_gte()
|
|
✓ |
13K |
13K1 |
00 |
○ |
— |
○ |
— | |||
| 48 | col_vals_gte()
|
|
✓ |
13K |
13K1 |
00 |
○ |
— |
○ |
— | |||
| 49 | col_vals_gte()
|
|
✓ |
13K |
13K1 |
10 |
○ |
— |
○ |
||||
| 2022-11-12 17:56:45 EST 1.4 s 2022-11-12 17:56:46 EST | |||||||||||||
glimpse(Tech_Correct)
## Rows: 13,096
## Columns: 45
## $ dataset <chr> "test_FD001", "test_FD001", "test_FD001",…
## $ esn <int> 999120, 999120, 999120, 999120, 999120, 9…
## $ unit <int> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 2…
## $ flight_cycle <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13…
## $ date <date> 2018-02-01, 2018-02-01, 2018-02-08, 2018…
## $ time <time> 13:47:42, 18:22:24, 00:46:00, 03:58:00, …
## $ operator <chr> "AIC", "AIC", "AIC", "AIC", "AIC", "AIC",…
## $ depart_icao <chr> "LFBO", "LFBO", "VIDP", "VABB", "VIDP", "…
## $ destination_icao <chr> "LFBO", "VIDP", "VABB", "VIDP", "VABB", "…
## $ hpc_eff_mod <dbl> 0.0011, -0.0001, -0.0015, -0.0022, -0.006…
## $ hpc_flow_mod <dbl> -5e-04, -3e-04, 3e-04, 1e-04, -1e-04, 5e-…
## $ tra <int> 100, 100, 100, 100, 100, 100, 100, 100, 1…
## $ t2 <dbl> 518.67, 518.67, 518.67, 518.67, 518.67, 5…
## $ t24 <dbl> 182.59, 182.56, 182.54, 183.01, 182.73, 1…
## $ t30 <dbl> 1588.38, 1590.73, 1590.70, 1588.91, 1588.…
## $ t50 <dbl> 1397.780, 1405.880, 1401.310, 1403.280, 1…
## $ p2 <dbl> 14.62, 14.62, 14.62, 14.62, 14.62, 14.62,…
## $ p15 <dbl> 21.61, 21.61, 21.60, 21.61, 21.61, 21.61,…
## $ p30 <dbl> 554.43, 553.62, 554.24, 553.79, 553.90, 5…
## $ nf <dbl> 2388.07, 2388.06, 2388.08, 2388.06, 2388.…
## $ nc <dbl> 9072.24, 9059.50, 9064.83, 9058.04, 9064.…
## $ epr <dbl> 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1…
## $ ps30 <dbl> 47.40, 47.32, 47.19, 47.28, 47.42, 47.21,…
## $ phi <dbl> 521.88, 522.03, 521.80, 522.25, 522.17, 5…
## $ nrf <dbl> 2388.06, 2388.06, 2388.03, 2388.06, 2388.…
## $ nrc <dbl> 8147.58, 8151.74, 8146.37, 8144.65, 8146.…
## $ bpr <dbl> 8.3923, 8.4385, 8.4234, 8.3955, 8.4371, 8…
## $ farb <dbl> 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03,…
## $ htbleed <int> 393, 392, 392, 391, 393, 394, 391, 392, 3…
## $ nf_dmd <int> 2388, 2388, 2388, 2388, 2388, 2388, 2388,…
## $ pcnfr_dmd <int> 100, 100, 100, 100, 100, 100, 100, 100, 1…
## $ w31 <dbl> 38.76, 38.90, 38.96, 38.94, 38.94, 39.11,…
## $ w32 <dbl> 23.4970, 23.4240, 23.4460, 23.3412, 23.44…
## $ rul <int> 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 1…
## $ x44321p02_op016_median_first <dbl> 23.25222, 23.25222, 23.25222, 23.25222, 2…
## $ x44321p02_op420_median_first <dbl> 14.54578, 14.54578, 14.54578, 14.54578, 1…
## $ x54321p01_op116_median_first <dbl> 22.15643, 22.15643, 22.15643, 22.15643, 2…
## $ x54321p01_op220_median_first <dbl> 29.89512, 29.89512, 29.89512, 29.89512, 2…
## $ x65421p11_op232_median_first <dbl> 188.632, 188.632, 188.632, 188.632, 188.6…
## $ x65421p11_op630_median_first <dbl> 231.0214, 231.0214, 231.0214, 231.0214, 2…
## $ depart_latitude <dbl> 43.635, 43.635, 28.569, 19.092, 28.569, 1…
## $ depart_longitude <dbl> 1.368, 1.368, 77.112, 72.866, 77.112, 72.…
## $ dest_latitude <dbl> 43.635, 28.569, 19.092, 28.569, 19.092, 1…
## $ dest_longitude <dbl> 1.368, 77.112, 72.866, 77.112, 72.866, 76…
## $ distance <dbl> 0.0000, 4211.6346, 708.2600, 708.2600, 70…
write.csv(Tech_Correct, file = "Cleaned_Data.csv", row.names = FALSE)